In this demo, we’ll first work with a dataset on the number of PhD degrees awarded in the US from TidyTuesday.
# Read in the tidytuesday data
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
phd_field <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-19/phd_by_field.csv")
## Rows: 3370 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): broad_field, major_field, field
## dbl (2): year, n_phds
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
phd_field
## # A tibble: 3,370 × 5
## broad_field major_field field year n_phds
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Life sciences Agricultural sciences and natural resources Agric… 2008 111
## 2 Life sciences Agricultural sciences and natural resources Agric… 2008 28
## 3 Life sciences Agricultural sciences and natural resources Agric… 2008 3
## 4 Life sciences Agricultural sciences and natural resources Agron… 2008 68
## 5 Life sciences Agricultural sciences and natural resources Anima… 2008 41
## 6 Life sciences Agricultural sciences and natural resources Anima… 2008 18
## 7 Life sciences Agricultural sciences and natural resources Anima… 2008 77
## 8 Life sciences Agricultural sciences and natural resources Envir… 2008 182
## 9 Life sciences Agricultural sciences and natural resources Fishi… 2008 52
## 10 Life sciences Agricultural sciences and natural resources Food … 2008 96
## # … with 3,360 more rows
Let’s start by grabbing the rows corresponding to Statistics PhDs.
While there are a number of ways to do this, we can grab
field containing “statistics” (including biostatistics)
with the str_detect() function.
stats_phds <- phd_field %>%
filter(str_detect(tolower(field), "statistics"))
What are the different fields that were captured?
table(stats_phds$field)
##
## Biometrics and biostatistics
## 10
## Educational statistics, research methods
## 10
## Management information systems, business statistics
## 10
## Mathematics and statistics, general
## 10
## Mathematics and statistics, other
## 10
## Statistics (mathematics)
## 10
## Statistics (social sciences)
## 10
To start, let’s just summarize the number of PhDs by
year:
stat_phd_year_summary <- stats_phds %>%
group_by(year) %>%
summarize(n_phds = sum(n_phds))
Now, we’ll make the typical scatterplot display with
n_phds on the y-axis and year on the
x-axis:
stat_phd_year_summary %>%
ggplot(aes(x = year, y = n_phds)) +
geom_point() +
theme_bw() +
labs(x = "Year", y = "Number of PhDs",
title = "Number of Statistics-related PhDs awarded over time")
We should fix our x-axis here and make the breaks more informative. In this case, I’ll change it so each year is labeled (that may not be appropriate for every visual but it works out here).
stat_phd_year_summary %>%
ggplot(aes(x = year, y = n_phds)) +
geom_point() +
# Modify the x-axis to make the axis breaks at the unique years and show their
# respective labels
scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
labels = unique(stat_phd_year_summary$year)) +
theme_bw() +
labs(x = "Year", y = "Number of PhDs",
title = "Number of Statistics-related PhDs awarded over time")
To emphasize the ordering of the year along the x-axis, we’ll add a line connecting the points to emphasize the order:
stat_phd_year_summary %>%
ggplot(aes(x = year, y = n_phds)) +
geom_point() +
geom_line() +
scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
labels = unique(stat_phd_year_summary$year)) +
theme_bw() +
labs(x = "Year", y = "Number of PhDs",
title = "Number of Statistics-related PhDs awarded over time")
We can drop the points, leaving only the connecting lines to emphasize trends:
stat_phd_year_summary %>%
ggplot(aes(x = year, y = n_phds)) +
geom_line() +
scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
labels = unique(stat_phd_year_summary$year)) +
theme_bw() +
labs(x = "Year", y = "Number of PhDs",
title = "Number of Statistics-related PhDs awarded over time")
Another common way to display trends is by filling in the area under the
line. However, this is only appropriate when the y-axis starts at 0!
It’s also redundant use of ink so just be careful when deciding whether
or not to fill the area. We can fill the area under the line with the
geom_area() aesthetic - but note that it changes the y-axis
by default to start at 0:
stat_phd_year_summary %>%
ggplot(aes(x = year, y = n_phds)) +
# Fill the area under the line
geom_area(fill = "darkblue", alpha = 0.5) +
geom_line() +
scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
labels = unique(stat_phd_year_summary$year)) +
theme_bw() +
labs(x = "Year", y = "Number of PhDs",
title = "Number of Statistics-related PhDs awarded over time")
You can also make this plot using the ggridges
package.
We’ll now switch to displaying the different Statistics fields
separately with the stats_phds dataset. First, we
should NOT display multiple time series with just points as
follows:
stats_phds %>%
ggplot(aes(x = year, y = n_phds, color = field)) +
geom_point() +
scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
labels = unique(stat_phd_year_summary$year)) +
theme_bw() +
theme(legend.position = "bottom",
# Adjust the size of the legend's text
legend.text = element_text(size = 5),
legend.title = element_text(size = 6)) +
labs(x = "Year", y = "Number of PhDs",
title = "Number of Statistics-related PhDs awarded over time",
color = "Field")
It’s much simpler to just display the lines to compare the trends:
stats_phds %>%
ggplot(aes(x = year, y = n_phds, color = field)) +
geom_line() +
scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
labels = unique(stat_phd_year_summary$year)) +
theme_bw() +
theme(legend.position = "bottom",
# Adjust the size of the legend's text
legend.text = element_text(size = 5),
legend.title = element_text(size = 6)) +
labs(x = "Year", y = "Number of PhDs",
title = "Number of Statistics-related PhDs awarded over time",
color = "Field")
The legend is pretty cluttered though, instead we can directly label
the displayed lines using the ggrepel
package. We first need to create a dataset with just the final
values (which in this case corresponds to year == 2017),
and then add labels for these values. To make the labels visible, we
need to increase our x-axis limits. Note that this is a “hack”, but you
will rely on hacks to customize visuals in the future… The following
code chunk demonstrates how to do this:
stats_phds_2017 <- stats_phds %>%
filter(year == 2017)
# Access the ggrepel package:
# install.packages("ggrepel")
library(ggrepel)
stats_phds %>%
ggplot(aes(x = year, y = n_phds, color = field)) +
geom_line() +
# Add the labels:
geom_text_repel(data = stats_phds_2017,
aes(label = field),
size = 2,
# Drop the segment connection:
segment.color = NA,
# Move labels up or down based on overlap
direction = "y",
# Try to align the labels horizontally on the left hand side
hjust = "left") +
scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
labels = unique(stat_phd_year_summary$year),
# Update the limits so that there is some padding on the
# x-axis but don't label the new maximum
limits = c(min(stat_phd_year_summary$year),
max(stat_phd_year_summary$year) + 3)) +
theme_bw() +
# Drop the legend
theme(legend.position = "none") +
labs(x = "Year", y = "Number of PhDs",
title = "Number of Statistics-related PhDs awarded over time",
color = "Field")
Next, let’s switch to back to the original dataset
phd_field. What happens if we plot a line for every field
attempting to use the color aesthetic to separate them?
phd_field %>%
ggplot(aes(x = year, y = n_phds, color = field)) +
geom_line() +
scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
labels = unique(stat_phd_year_summary$year)) +
theme_bw() +
theme(legend.position = "none") +
labs(x = "Year", y = "Number of PhDs",
title = "Number of Statistics-related PhDs awarded over time",
color = "Field")
## Warning: Removed 270 rows containing missing values (`geom_line()`).
The plot above is obviously a disaster… When we are dealing with
potentially way too many categories, we can instead highlight lines of
interest while setting the background lines to gray, so we can still see
background trends. We need to use the group aesthetic to
split the gray lines from each other. Plus, we should adjust the alpha
due to the overlap. The following code chunk demonstrates how to do this
for highlighting the “Statistics (mathematics)” and “Biometrics and
biostatistics” lines. We essentially create separate plot layers by
filtering on the field variable:
# First display the background lines using the full dataset with those two fields
# filtered out:
phd_field %>%
# The following line says: NOT (field in c("Biometrics and biostatistics", "Statistics (mathematics)"))
filter(!(field %in% c("Biometrics and biostatistics",
"Statistics (mathematics)"))) %>%
ggplot() +
# Add the background lines - need to specify the group to be the field
geom_line(aes(x = year, y = n_phds, group = field),
color = "gray", size = .5, alpha = .5) +
# Now add the layer with the lines of interest:
geom_line(data = filter(phd_field,
# Note this is just the opposite of the above since ! is removed
field %in% c("Biometrics and biostatistics",
"Statistics (mathematics)")),
aes(x = year, y = n_phds, color = field),
# Make the size larger
size = .75, alpha = 1) +
scale_x_continuous(breaks = unique(stat_phd_year_summary$year),
labels = unique(stat_phd_year_summary$year)) +
theme_bw() +
theme(legend.position = "bottom",
# Drop the panel lines making the gray difficult to see
panel.grid = element_blank()) +
labs(x = "Year", y = "Number of PhDs",
title = "Number of Statistics-related PhDs awarded over time",
color = "Field")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## Warning: Removed 270 rows containing missing values (`geom_line()`).
Another way to visualize time series data is to display it in a cycle
pattern, using polar coordinates, as done by Florence Nightingale’s
famous rose diagram. We can recreate the rose diagram by accessing the
data in the HistData package. We’ll first load and print
out the first so many rows of the data below:
library(HistData)
head(Nightingale)
## Date Month Year Army Disease Wounds Other Disease.rate Wounds.rate
## 1 1854-04-01 Apr 1854 8571 1 0 5 1.4 0.0
## 2 1854-05-01 May 1854 23333 12 0 9 6.2 0.0
## 3 1854-06-01 Jun 1854 28333 11 0 6 4.7 0.0
## 4 1854-07-01 Jul 1854 28722 359 0 23 150.0 0.0
## 5 1854-08-01 Aug 1854 30246 828 1 30 328.5 0.4
## 6 1854-09-01 Sep 1854 30290 788 81 70 312.2 32.1
## Other.rate
## 1 7.0
## 2 4.6
## 3 2.5
## 4 9.6
## 5 11.9
## 6 27.7
To recreate the plot, we’ll need to first make a longer version of
the dataset with the Disease, Wounds, and
Other columns separated into three rows. To do that, we’ll
use the pivot_longer() function after just selecting the
columns of interest for our plot:
crimean_war_data <- Nightingale %>%
dplyr::select(Date, Month, Year, Disease, Wounds, Other) %>%
# Now pivot those columns to take up separate rows:
pivot_longer(Disease:Other,
names_to = "cause", values_to = "count")
Next, we’ll make a label column matching Nightingale’s plot based on
the Date column. We’ll talk about dates more in next
lecture, but we can condition on being above or below certain dates in a
natural way:
crimean_war_data <- crimean_war_data %>%
mutate(time_period = ifelse(Date <= as.Date("1855-03-01"),
"April 1854 to March 1855",
"April 1855 to March 1856"))
And finally we can go ahead and display the rose diagram facetted by the time period (using similar colors to Nightingale):
crimean_war_data %>%
ggplot(aes(x = Month, y = count)) +
geom_col(aes(fill = cause), width = 1,
position = "identity", alpha = 0.5) +
coord_polar() +
facet_wrap(~ time_period, ncol = 2) +
scale_fill_manual(values = c("skyblue3", "grey30", "firebrick")) +
scale_y_sqrt() +
theme_void() +
# All of this below is to just customize the theme in a way that we are
# close to resembling the original plot (ie lets make it look old!)
theme(axis.text.x = element_text(size = 9),
strip.text = element_text(size = 11),
legend.position = "bottom",
plot.background = element_rect(fill = alpha("cornsilk", 0.5)),
plot.margin = unit(c(10, 10, 10, 10), "pt"),
plot.title = element_text(vjust = 5)) +
labs(title = "Diagram of the Causes of Mortality in the Army in the East")
This looks pretty close to the original diagram, except the order of the months does not match the original. We can of course change that by reordering the factor variable:
crimean_war_data %>%
# Manually relevel it to match the original plot
mutate(Month = fct_relevel(Month,
"Jul", "Aug", "Sep", "Oct", "Nov",
"Dec", "Jan", "Feb", "Mar", "Apr", "May", "Jun")) %>%
ggplot(aes(x = Month, y = count)) +
geom_col(aes(fill = cause), width = 1,
position = "identity", alpha = 0.5) +
coord_polar() +
facet_wrap(~ time_period, ncol = 2) +
scale_fill_manual(values = c("skyblue3", "grey30", "firebrick")) +
scale_y_sqrt() +
theme_void() +
# All of this below is to just customize the theme in a way that we are
# close to resembling the original plot (ie lets make it look old!)
theme(axis.text.x = element_text(size = 9),
strip.text = element_text(size = 11),
legend.position = "bottom",
plot.background = element_rect(fill = alpha("cornsilk", 0.5)),
plot.margin = unit(c(10, 10, 10, 10), "pt"),
plot.title = element_text(vjust = 5)) +
labs(title = "Diagram of the Causes of Mortality in the Army in the East")
How does this compare to just a simple line graph?
crimean_war_data %>%
ggplot(aes(x = Date, y = count, color = cause)) +
geom_line() +
# Add a reference line at the cutoff point
geom_vline(xintercept = as.Date("1855-03-01"), linetype = "dashed",
color = "gray") +
scale_color_manual(values = c("skyblue3", "grey30", "firebrick")) +
scale_y_sqrt() +
theme_bw() +
theme(legend.position = "bottom") +
labs(title = "Diagram of the Causes of Mortality in the Army in the East",
y = "sqrt(counts)", x = "Date")
We can customize the x-axis further using scale_x_date():
crimean_war_data %>%
ggplot(aes(x = Date, y = count, color = cause)) +
geom_line() +
# Add a reference line at the cutoff point
geom_vline(xintercept = as.Date("1855-03-01"), linetype = "dashed",
color = "gray") +
scale_color_manual(values = c("skyblue3", "grey30", "firebrick")) +
scale_y_sqrt() +
# Format to use abbreviate month %b with year %Y
scale_x_date(date_labels = "%b %Y") +
theme_bw() +
theme(legend.position = "bottom") +
labs(title = "Diagram of the Causes of Mortality in the Army in the East",
y = "sqrt(counts)", x = "Date")
Which one do you prefer? Maybe filling the area under the lines would be better here…
gganimate to
add animationsBy far, the simplest way to create visualizations with animations is
to use the gganimate
package. This effectively works as an extension to
ggplot figures but with the inclusion of various
transition_* functions
First, let’s think about when you should NOT animate
a plot. We first create a visualization of the penguins
data from before, of bill length on the y-axis against the
body mass on the x-axis colored by
species:
library(palmerpenguins)
penguins %>%
ggplot(aes(x = body_mass_g, y = bill_length_mm, color = species)) +
geom_point(alpha = 0.5, size = 2) +
labs(x = "Body Mass (g)", y = "Bill Length (mm)") +
theme_bw()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Now, we could do the following: use the
gganimate package to only display one species
at a time with the transition_states() function:
library(gganimate)
penguins %>%
ggplot(aes(x = body_mass_g, y = bill_length_mm, color = species)) +
geom_point(alpha = 0.5, size = 2) +
labs(x = "Body Mass (g)", y = "Bill Length (mm)") +
theme_bw() +
transition_states(species,
transition_length = 0.5,
state_length = 1)
The use of transition_length and
state_length indicate how much relative time
should take place when transitioning between states and the pause at
each state, respectively. But the above use of animation is
useless!
So when should you consider using animation?
One appropriate usage is in the context of storytelling with data, to emphasize some aspect of your visual display. For instance, we’ll borrow this F1 racing dataset from Meghan Hall’s iteration of 36-315 to compare the performance of three racing teams:
# First load the data from Meghan's github
f1_data_ex <- read_csv('https://raw.githubusercontent.com/meghall06/CMU-36-315-site/main/data/constructor_pts.csv') %>%
filter(name %in% c("McLaren", "Renault", "Racing Point"),
year == 2020)
## Rows: 1260 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): name, race_name
## dbl (9): constructorStandingsId, raceId, constructorId, points, position, p...
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Now display the results across the rounds:
f1_data_ex %>%
ggplot(aes(x = round, y = points, group = name, color = name)) +
geom_line(size = 2) +
scale_x_continuous(breaks = seq(1, 17, 1)) +
labs(title = "The race for third place in the 2020 F1 season",
y = "Accumulated points", x = NULL) +
theme_bw()
From above we can see the accumulated points increasing over time for
each team, with McLaren finishing better than both, Racing Point and
Renault, at the end. But we could incrementally reveal the results at
each stage emphasize the story of progression. We’re not adding another
dimension to the display, but we emphasize the intermediate results
through animation with the transition_reveal()
function:
f1_data_ex %>%
ggplot(aes(x = round, y = points, group = name, color = name)) +
geom_line(size = 2) +
scale_x_continuous(breaks = seq(1, 17, 1)) +
labs(title = "The race for third place in the 2020 F1 season",
y = "Accumulated points", x = NULL) +
theme_bw() +
# Reveal the results by round
transition_reveal(round)
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
The most effective use of animation is when it adds another dimension
to your visualization, typically in the form of time. The previous
visualization only animated across the x-axis - it did NOT add another
variable in our data. However, animation can let us bring in another
dimension so that we can see differences between relationships of
variables in various ways. You should watch Hans Rosling’s 200
Countries, 200 Years, 4 Minutes to see one example in action. We can
make similar visualizations with gganimate.
In the code chunk below, we’re going to display yearly summaries
about housing sales in TX (dataset comes loaded with
ggplot2). We’re going to plot the average number of active
listings and average median sale price for each city-year combination in
the data. For context, we’re going to highlight the data for Houston in
red with a larger point size:
# Load the scales package for better labeling of the axes
txhousing %>%
group_by(city, year) %>%
summarize(median = mean(median, na.rm = TRUE),
listings = mean(listings, na.rm = TRUE)) %>%
ggplot(aes(x = median, y = listings,
color = (city == "Houston"),
size = (city == "Houston"))) +
# Hide the legend for the point layer
geom_point(alpha = 0.5, show.legend = FALSE) +
# Manual color label
scale_color_manual(values = c("black", "darkred")) +
# Manual size adjustment
scale_size_manual(values = c(2, 4)) +
scale_x_continuous(labels = scales::dollar, name = "Median Price") +
scale_y_continuous(labels = scales::label_number_si()) +
theme_bw() +
labs(x = "Median Price", y = "Avg. of Monthly Listings",
subtitle = "Houston in red")
## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.
## Warning: `label_number_si()` was deprecated in scales 1.2.0.
## ℹ Please use the `scale_cut` argument of `label_number()` instead.
## Warning: Removed 68 rows containing missing values (`geom_point()`).
In the figure above we do not have year included in any way. But we
can use the transition_time() function to animate the
visual over time, while also updating the plot title to include the
displayed year:
# Load the scales package for better labeling of the axes
txhousing %>%
group_by(city, year) %>%
summarize(median = mean(median, na.rm = TRUE),
listings = mean(listings, na.rm = TRUE)) %>%
ggplot(aes(x = median, y = listings,
color = (city == "Houston"),
size = (city == "Houston"))) +
# Hide the legend for the point layer
geom_point(alpha = 0.5, show.legend = FALSE) +
# Manual color label
scale_color_manual(values = c("black", "darkred")) +
# Manual size adjustment
scale_size_manual(values = c(2, 4)) +
scale_x_continuous(labels = scales::dollar, name = "Median Price") +
scale_y_continuous(labels = scales::label_number_si()) +
theme_bw() +
labs(x = "Median Price", y = "Avg. of Monthly Listings",
subtitle = "Houston in red",
title = "Year: {frame_time}") +
transition_time(year)
## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.
From viewing the above visual, you can see how animation makes
changes appear more dramatic between years - versus plotting each year
separately with facets. We can then save the above animation as a GIF
with the anim_save("INSERT/FILEPATH") function, which will
save the last animation you made by default.
anim_save("examples/txhousing.gif")
Some key points to think about before adding animation to a visualization:
Always make and describe the original / base graphic first that does NOT include animation.
Before adding animation to the graph, ask yourself: How would animation give you additional insights about the data that you would otherwise not be able to?
Never add animation just because it’s cool!
When presenting, make sure you explain exactly what is being displayed with animation and what within the animation you want to emphasize. This will help you determine if animation is actually worth including.